Preliminaries

Load libraries

If you get an error go to the packages panel, click install and type the name of the library (you only need to do this once).

library(MLTools)
library(fpp2)
library(readxl)
library(tidyverse)
library(lmtest)  # contains coeftest function
library(tseries) # contains adf.test function
library(TSstudio)

Set the working directory

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Reading time series data and EDA

Load dataset

fdata <- read_excel("ARIMA_example.xlsx")

Visualize the first rows:

head(fdata)
## # A tibble: 6 × 1
##       y
##   <dbl>
## 1  7.07
## 2  7.01
## 3  7.06
## 4  7.15
## 5  7.21
## 6  7.15

Convert it to a time series object

Assume that there are no time gaps, and use the natural numbers as index.

fdata_ts <- ts(fdata)
head(fdata_ts)
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##             y
## [1,] 7.070480
## [2,] 7.014591
## [3,] 7.063700
## [4,] 7.149487
## [5,] 7.209597
## [6,] 7.147241

and we get some basic information

ts_info(fdata_ts)
##  The fdata_ts series is a ts object with 1 variable and 1000 observations
##  Frequency: 1 
##  Start time: 1 1 
##  End time: 1000 1

We use the column index to select a time series

y <- fdata_ts[ , 1]

Stationarity and Model Identification

ACF and PACF plots of the time series

We use the following plots to assess the stationarity of the time series. With ggtsdisplay we get a time plot and both the ACF and PACF.

ggtsdisplay(y)

Dealing with non homogeneous variance (heteroskedasticity)

We are going to decide if we want to apply a Box-Cox transformation

Lambda <- BoxCox.lambda.plot(y, window.width = 10)
## `geom_smooth()` using formula = 'y ~ x'

# Lambda <- BoxCox.lambda(y) #other option

z is the transformed time series if Box-Cox is applied.
Note: If you decide that you do not want to use Box-Cox, set useLambda <- FALSE in the next code line:

useLambda <- TRUE

if(useLambda) {
  z <- BoxCox(y, Lambda)
} else {
  z <- y
  Lambda <- 1
}

We can plot together the original and transformed time series to see the result of the transformation.

autoplot(y, linewidth=0.25, color = "red") + 
  autolayer(z, color = "blue", alpha=0.5)

Deciding if differencing is needed

When the ACF decreases very slowly the time series needs differentiation. We analyze the ACF of the (possibly Box-Cox transformed) time series.

ggtsdisplay(z, lag.max = 25)

Two alternative ways of checking for stationarity: the first one is to use the augmented Dickey-Fuller (adf) test of the null: data non-stationary and non seasonal (see (Hyndman and Athanasopoulos 2021), Unit Root Tests and the video therein)

adf.test(z, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  z
## Dickey-Fuller = -3.3705, Lag order = 9, p-value = 0.05852
## alternative hypothesis: stationary

Check the size of the p-value. The fpp3 book also discusses the kpss test, that we can apply uncommenting the next code line.

# tseries::kpss.test(z, null = "Trend")

A second way: the ndiffs function uses the previous tests to suggest a differencing order that will make the time series stationary:

ndiffs(z)
## [1] 1

Either way, now you need to decide the number d of differences to apply, if any:

d <- 1

If differencing is needed we apply it here:

Bz <- if(d > 0){
  diff(z, differences = 1) 
} else {
    z
  }

ARIMA model selection and fit

We use the ACF and PACF of the Bz time series to identify the significant lags and propose the ARMA structure of the model:

ggtsdisplay(Bz, lag.max = 25)

Now we propose a structure of the model by selecting the values of (p, d, q) where d is for the number of differences (already decided).

p <- 2
q <- 0

This is the chosen (p, d, q) structure:

cat(paste0(c("p", "d", "q")," = ", c(p, d, q)))
## p = 2 d = 1 q = 0

We fit the model with the above values. Observe that we apply it to the original (untransformed, undifferenced) time series y, because the Arima function includes arguments for both Lambda and d.

Note: also remember that if you want a constant term you need to include it here.

arima.fit <- Arima(y,
                   order = c(p, d, q), 
                   lambda = Lambda,
                   #lambda = 1,
                   include.constant = TRUE)

Model Diagnosis

Summary of training errors and estimated coefficients

summary(arima.fit) 
## Series: y 
## ARIMA(2,1,0) with drift 
## Box Cox transformation: lambda= 0.4240885 
## 
## Coefficients:
##          ar1      ar2    drift
##       0.7919  -0.4986  -0.0019
## s.e.  0.0274   0.0274   0.0015
## 
## sigma^2 = 0.001056:  log likelihood = 2006.94
## AIC=-4005.87   AICc=-4005.83   BIC=-3986.24
## 
## Training set error measures:
##                         ME       RMSE        MAE         MPE     MAPE      MASE
## Training set -0.0002542038 0.07572928 0.06016474 -0.01057471 1.481481 0.7310768
##                      ACF1
## Training set 0.0009794669

Statistical significance of estimated coefficients

If some coefficient is non significant you should reconsider the choice of model order.

coeftest(arima.fit) 
## 
## z test of coefficients:
## 
##         Estimate Std. Error  z value Pr(>|z|)    
## ar1    0.7919024  0.0273957  28.9061   <2e-16 ***
## ar2   -0.4985500  0.0273900 -18.2019   <2e-16 ***
## drift -0.0018654  0.0014532  -1.2836   0.1993    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following root plot is used to assess the stationarity and invertibility of the model.

autoplot(arima.fit) 

Check residuals

Again, if the residuals are not white noise, you should reconsider the choice of the ARMA model order.

CheckResiduals.ICAI(arima.fit, bins = 100, lag = 25)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0) with drift
## Q* = 16.226, df = 22, p-value = 0.8045
## 
## Model df: 3.   Total lags used: 25

Check the p-value of the Ljung-Box test above. The null hypothesis for this test can be rephrased as the series is not white noise.

Fitted values and forecast

Check the fitted values against the real ones

autoplot(y, series = "Real", size=2, alpha=0.5) +
  forecast::autolayer(arima.fit$fitted, series = "Fitted")

Perform a true (future) forecast

We need to select the prediction horizon h

y_est <- forecast(arima.fit, h=12)

The result is a forecast type object. It contains not only the forecasted values, but also uncertainty intervals and extra information about the fit. Printing the forecast object shows some of this information:

y_est
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1001       2.632242 2.560116 2.705524 2.522401 2.744788
## 1002       2.622788 2.476332 2.774112 2.400759 2.856202
## 1003       2.613486 2.414535 2.821562 2.312873 2.935441
## 1004       2.608535 2.379704 2.849548 2.263442 2.982120
## 1005       2.606956 2.360258 2.867880 2.235351 3.011836
## 1006       2.605882 2.344692 2.883082 2.212820 3.036388
## 1007       2.603531 2.327063 2.898021 2.187900 3.061310
## 1008       2.599919 2.306830 2.913373 2.159796 3.087671
## 1009       2.595947 2.286319 2.928432 2.131516 3.113835
## 1010       2.592320 2.267474 2.942455 2.105576 3.138211
## 1011       2.589145 2.250559 2.955337 2.082299 3.160553
## 1012       2.586160 2.234825 2.967350 2.060701 3.181440

The forecast object can also be plotted.

autoplot(y_est)

In the following plot we subset the original time series to improve the visualization of the h forecasted values.

autoplot(subset(y, start = 900)) +
  autolayer(y_est)

Additional Code

Arma model selection with train/test split and performance measures

Let us go over the same dataset but this time we will use functions from other libraries to perform a basic train/test split and get some performance measures.

n <- length(y)
n_test <- floor(0.2 * n)


library(TSstudio)
y_split <- ts_split(y, sample.out = n_test)

y_TS <- y_split$test
y_TR <- y_split$train

Now we will briefly go over the same model identification and fitting steps, but using the training values. ### Box-Cox

Lambda <- BoxCox.lambda.plot(y_TR, window.width = 10)
## `geom_smooth()` using formula = 'y ~ x'

useLambda <- TRUE

if(useLambda) {
  z <- BoxCox(y_TR, Lambda)
} else {
  z <- y_TR
  Lambda <- 1
}

Deciding if differencing is needed

When the ACF decreases very slowly the time series needs differentiation. We analyze the ACF of the possibly Box-Cox transformed time series.

ggtsdisplay(z, lag.max = 25)

ADF test for stationarity

adf.test(z, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  z
## Dickey-Fuller = -3.2778, Lag order = 9, p-value = 0.07451
## alternative hypothesis: stationary

Also check ndiffs

ndiffs(z)
## [1] 1

Choose value of d

d <- 1

If differencing is needed we apply it here:

Bz <- if(d > 0){
  diff(z, differences = 1) 
} else {
  z
}

ARIMA model selection and fit

ACF and PACF of Bz

ggtsdisplay(Bz, lag.max = 25)

Now we propose a structure (p, d, q)
d selected above

p <- 2
q <- 1

Fit the model

arima.fit <- Arima(y_TR,
                   order=c(p, d, q), 
                   lambda = Lambda,
                   include.constant = TRUE)

Model Diagnosis

Summary of training errors and estimated coefficients

summary(arima.fit) 
## Series: y_TR 
## ARIMA(2,1,1) with drift 
## Box Cox transformation: lambda= 0.5378993 
## 
## Coefficients:
##          ar1      ar2      ma1    drift
##       0.8550  -0.5517  -0.0455  -0.0028
## s.e.  0.0553   0.0382   0.0665   0.0019
## 
## sigma^2 = 0.001495:  log likelihood = 1466.79
## AIC=-2923.58   AICc=-2923.5   BIC=-2900.16
## 
## Training set error measures:
##                        ME       RMSE        MAE        MPE     MAPE      MASE
## Training set 4.912899e-05 0.08057908 0.06449987 -0.0134504 1.367553 0.7126226
##                      ACF1
## Training set -0.002289066

Statistical significance of estimated coefficients

coeftest(arima.fit) 
## 
## z test of coefficients:
## 
##         Estimate Std. Error  z value Pr(>|z|)    
## ar1    0.8550364  0.0552765  15.4683   <2e-16 ***
## ar2   -0.5516655  0.0381541 -14.4589   <2e-16 ***
## ma1   -0.0455003  0.0664709  -0.6845   0.4937    
## drift -0.0028116  0.0018708  -1.5029   0.1329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Root plot

autoplot(arima.fit) 

Check residuals

CheckResiduals.ICAI(arima.fit, bins = 100, lag = 25)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1) with drift
## Q* = 16.845, df = 21, p-value = 0.7205
## 
## Model df: 4.   Total lags used: 25

Fitted values and forecast

autoplot(y_TR, series = "Training", color = "blue", size=2, alpha=0.7) +
  autolayer(arima.fit$fitted, series = "Fitted") 

# Perform a future forecast 
y_frcst <- forecast(arima.fit, h=n_test)

Plot of the final part of the training set and the forecast

autoplot(subset(y_TR, start = 550), series = "Training", color = "blue", size=2, alpha=0.7) +
  autolayer(subset(arima.fit$fitted, start = 550), series = "Fitted")  + 
  autolayer(y_frcst, series ="Forecast") + 
  autolayer(y_TS, series = "Test")

Using tslm to create a baseline model

Quoting Hyndman’s fpp3:
The basic concept is that we forecast the time series of interest y assuming that it has a linear relationship with other time series x
The time series x may even be one of the components of y, e.g. the trend. We will explore this idea below.

We fit a linear model with the tslm function. You can read about tslm in Chapter 5 of FPP2 (Hyndman’s book, 2nd ed.). In the third edition it has been replaced with TSLM.

In this case we use the trend keyword to indicate that we use the trend as input variable.

tslm_model <- tslm(y_TR ~ trend)

Check the significance of the coefficients

summary(tslm_model)
## 
## Call:
## tslm(formula = y_TR ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51870 -0.32676  0.04508  0.40835  1.12794 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.360e+00  3.646e-02   201.9   <2e-16 ***
## trend       -5.939e-03  7.887e-05   -75.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5152 on 798 degrees of freedom
## Multiple R-squared:  0.8766, Adjusted R-squared:  0.8765 
## F-statistic:  5670 on 1 and 798 DF,  p-value: < 2.2e-16

We can now use this linear model to do a forecast

y_tslm <- forecast(tslm_model, h=n_test)

# And explore the forecast graphically
autoplot(subset(y_TR, start = 550), series = "Training", color = "blue") +
  autolayer(y_tslm, series ="TSLM") + 
  autolayer(y_frcst, series ="Forecast", alpha = 0.25) + 
  autolayer(y_TS, series = "Test")

Zooming in the beginning of the forecast

autoplot(subset(y_TR, start = 790), series = "Training", color = "blue") +
autolayer(window(y_frcst$mean, end=820), series="Arima") + 
  autolayer(window(y_tslm$mean, end=820), series="tslm") +
  ylab(" ")

Model Comparison

Now we have trained two models, and we can compare them using their performance on the test set.

forecast::accuracy(y_frcst, y_TS)
##                        ME       RMSE        MAE        MPE      MAPE      MASE
## Training set 4.912899e-05 0.08057908 0.06449987 -0.0134504  1.367553 0.7126226
## Test set     5.753839e-02 0.36826130 0.28350236  0.6245963 12.864040 3.1322577
##                      ACF1 Theil's U
## Training set -0.002289066        NA
## Test set      0.974213635  6.126543
forecast::accuracy(y_tslm, y_TS)
##                         ME      RMSE       MAE       MPE      MAPE     MASE
## Training set -2.338407e-17 0.5145344 0.4199386 -1.354819  9.253299 4.639665
## Test set      2.211409e-01 0.4830432 0.3582143  8.209833 15.704776 3.957708
##                   ACF1 Theil's U
## Training set 0.9759082        NA
## Test set     0.9760039  7.475992

Some explicit formulas:

# mean(abs(residuals(tslm_model)))
# sqrt(mean(residuals(tslm_model)^2))

Why Arima?

As you can see the error of the linear model is worse than that of the Arima model. A exploration of the residuals of the linear model shows that they are clearly not white noise.

ggtsdisplay(residuals(tslm_model))

That means that the linear model is not able to extract as much information from the time series as Arima does.

References

Hyndman, R. J., and G. Athanasopoulos. 2021. Forecasting: Principles and Practice. OTexts. https://otexts.com/fpp3/.